Language Identification Using Minimum Linguistic Information
نویسندگان
چکیده
Automatic spoken language identification is the problem of identifying the language being spoken from a sample of speech by an unknown speaker. Current language identification systems vary in their complexity. The systems that use higher level information have the best performance. Nevertheless, that information is hard to collect for each new language. In this work, we present a state of the art language identification system, which uses very little linguistic information, and so easily extendable to new languages. In fact, the presented system needs only one language specific phone recogniser (in our case the Portuguese one), and is trained with speech from each of the other languages. We studied the problem of language identification in the context of the European languages (including, for the first time, European Portuguese), which allowed us to study the effect of language proximity in Indo-European languages. The results reveal a significant impact on the identification of some languages. With the SpeechDat-M corpus, with 6 European languages (English, French, German, Italian, Portuguese and Spanish) our system achieved an identification rate of about 80% on 5-second utterances.
منابع مشابه
The Role of Non-Linguistic Variables in Production of Complex Linguistic Structures by Hearing-Impaired Children
Objectives: Language development is often very slower in hearing impaired children compared with their normal peers. Hearing impairment during childhood affects all aspects of speech production and language acquisition. It seems that hearing impaired people suffer from language and speech impairments such as production of complex linguistic structures. The purpose of this study is to determine ...
متن کاملLanguage identification of code switching Malay-English words using syllable structure information
This paper introduces a language identification approach using syllable structure information. We also review and compare other approaches. Most of these approaches use linguistic information for language identification. The information used for language identification is Malay affixation information, English vocabulary list, alphabet ngram, grapheme n-gram. The approach using syllable structur...
متن کاملUsing a Single Framework for Computational Modeling of Linguistic Similarity for Solving Many NLP Problems
In this paper we show how a single framework for computational modeling of linguistic similarity can be used for solving many problems. Similarity can be measured within or across languages and at various linguistic levels. We model linguistic similarity in three stages: surface similarity, contextual similarity and distributional similarity. We have successfully used the framework for several ...
متن کاملA Study of the Relationship between Acoustic Features of “bæle” and the Paralinguistic Information
Language users benefit from special phonetic tools in order to communicate linguistic information as well as different emotional aspects and paralinguistic information through daily conversation. Having functions in conveying semantic information to listeners, prosodic features form the essential part of linguistic behavour, manipulating them potentially can play an important role in transmitt...
متن کاملAdvancing Linguistic Features and Insights by Label-informed Feature Grouping: An Exploration in the Context of Native Language Identification
We propose a hierarchical clustering approach designed to group linguistic features for supervised machine learning that is inspired by variationist linguistics. The method makes it possible to abstract away from the individual feature occurrences by grouping features together that behave alike with respect to the target class, thus providing a new, more general perspective on the data. On the ...
متن کامل